Introduction to Lexical Similarity

Author

Dattatreya Majumdar

Published

January 1, 2025

Introduction

This tutorial introduces Text Similarity (see Zahrotun 2016; Li and Han 2013), i.e. how close or similar two pieces of text are with respect to either their use of words or characters (lexical similarity) or in terms of meaning (semantic similarity).This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to assess the similarity of texts in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with assessing text similarity.

Lexical Similarity provides a measure of the similarity of two texts based on the intersection of the word sets of same or different languages. A lexical similarity of 1 suggests that there is complete overlap between the vocabularies while a score of 0 suggests that there are no common words in the two texts. There are several different ways of evaluating lexical similarity such as Jaccard Similarity, Cosine Similarity, Levenshtein Distance etc.

Semantic Similarity on the other hand measures the similarity between two texts based on their meaning rather than their lexicographical similarity. Semantic similarity is highly useful for summarizing texts and extracting key attributes from large documents or document collections. Semantic Similarity can be evaluated using methods such as Latent Semantic Analysis (LSA), Normalised Google Distance (NGD), Salient Semantic Analysis (SSA) etc.

As a part of this tutorial we will focus primarily on Lexical Similarity. We begin with a brief overview of relevant concepts and then show different measures can be implemented in R.

Jaccard Similarity

The Jaccard similarity is defined as an intersection of two texts divided by the union of that two documents. In other words it can be expressed as the number of common words over the total number of the words in the two texts or documents. The Jaccard similarity of two documents ranges from 0 to 1, where 0 signifies no similarity and 1 signifies complete overlap.The mathematical representation of the Jaccard Similarity is shown below: -

\[\begin{equation} J(A,B) = \frac{|A \bigcap B|}{|A \bigcup B |} = \frac{|A \bigcap B|}{|A| + |B| - |A \bigcap B|} \end{equation}\]

Cosine Similarity

In case of cosine similarity the two documents are represented in a n-dimensional vector space with each word represented in a vector form. Thus the cosine similarity metric measures the cosine of the angle between two n-dimensional vectors projected in a multi-dimensional space. The cosine similarity ranges from 0 to 1. A value closer to 0 indicates less similarity whereas a score closer to 1 indicates more similarity.The mathematical representation of the Cosine Similarity is shown below: -

\[\begin{equation} similarity = cos(\theta) = \frac{A \cdot B}{||A|| ||B||} = \frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}} \end{equation}\]

Levenshtein Distance

Levenshtein distance comparison is generally carried out between two words. It determines the minimum number of single character edits required to change one word to another. The higher the number of edits more are the texts different from each other.An edit is defined by either an insertion of a character, a deletion of character or a replacement of a character. For two words a and b with lengths i and j the Levenshtein distance is defined as follows: -

\[\begin{equation} lev_{a,b}(i,j) = \begin{cases} max(i,j) & \quad \text{if min(i,j) = 0,}\\ min \begin{cases} lev_{a,b}(i-1,j)+1 \\ lev_{a,b}(i, j-1)+1 & \text{otherwise.}\\ lev_{a,b}(i-1,j-1)+1_{(a_{i} \neq b_{j})} \\ \end{cases} \end{cases} \end{equation}\]

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).

Code

# set options
options(stringsAsFactors = F)
# install libraries
install.packages("stringdist")
install.packages("hashr")
install.packages("tidyverse")

Now that we have installed the packages, we activate them as shown below.

Code

# set options
options(stringsAsFactors = F) # no automatic data transformation
options("scipen" = 100, "digits" = 12) # suppress math annotation
# activate packages
library(stringdist)
library(hashr)
library(tidyverse)

Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.

Measuring Similarity in R

For evaluating the similarity scores and the edit distance for the above discussed methods in R we have installed the stringdist package and will be primarily using two functions in that: stringdist and stringsim. We are also utilising the hashr package so that Jaccard and cosine similarity are evaluated word wise instead of letter wise. The sentence is tokenised and the corresponding list of words are hashed so that the sentences are transformed into a list of integers.For the Jaccard and the Cosine similarity we will be using the same set of texts whereas for the Levenshtein edit distance we will take 3 pairs of words to illustrate insert, delete and replace operations.

Code

text1 <- "The quick brown fox jumped over the wall"
text2 <- "The fast brown fox leaped over the wall"
insert_ex <- c("Marta", "Martha")
del_ex <- c("Genome", "Gnome")
rep_ex <- c("Tim", "Tom")

Jaccard Similarity

Code

# Using the seq_dist function along with hash function to calculate the Jaccard similarity word-wise
jac_sim_score <- seq_dist(hash(strsplit(text1, "\\s+")), hash(strsplit(text2, "\\s+")), method = "jaccard", q = 2)
print(paste0("The Jaccard similarity for the two texts is ", jac_sim_score))

[1] "The Jaccard similarity for the two texts is 0.727272727272727"

Cosine Similarity

Code

# Using the seq_dist function along with hash function to calculate the Jaccard similarity word-wise
cos_sim_score <- seq_dist(hash(strsplit(text1, "\\s+")), hash(strsplit(text2, "\\s+")), method = "cosine", q = 2)
print(paste0("The Cosine similarity for the two texts is ", cos_sim_score))

[1] "The Cosine similarity for the two texts is 0.571428571428572"

Levenshtein distance

Code

# Insert edit
ins_edit <- stringdist(insert_ex[1], insert_ex[2], method = "lv")
print(paste0("The insert edit distance for ", insert_ex[1], " and ", insert_ex[2], " is ", ins_edit))

[1] "The insert edit distance for Marta and Martha is 1"

Code

# Delete edit
del_edit <- stringdist(del_ex[1], del_ex[2], method = "lv")
print(paste0("The delete edit distance for ", del_ex[1], " and ", del_ex[2], " is ", del_edit))

[1] "The delete edit distance for Genome and Gnome is 1"

Code

# Replace edit
rep_edit <- stringdist(rep_ex[1], rep_ex[2], method = "lv")
print(paste0("The replace edit distance for ", rep_ex[1], " and ", rep_ex[2], " is ", rep_edit))

[1] "The replace edit distance for Tim and Tom is 1"

Concluding remarks

As shown above, the Jaccard and Cosine similarity scores are different which is important to note when using different measures to determine similarity. The differences are primarily primarily caused because Jaccard takes only the unique words in the two texts into consideration whereas the Cosine similarity approach takes the total length of the vectors into consideration. For the Levenshtein edit distance, the examples provided above show that for the first case we have to insert an extra h, for the second we have to delete an e and for the last case we need to replace i with o. Thus, for all the pairs taken into account here the edit distance is 1.

Citation & Session Info

Citation

Dattatreya Majumdar. 2025. Introduction to Lexical Similarity. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/lexsim/lexsim.html (Version 2025.04.02), doi: .

@manual{dattatreyamajumdar2025introduction,
  author       = {Dattatreya Majumdar},
  title        = {Introduction to Lexical Similarity},
  year         = {2025},
  note         = {https://ladal.edu.au/tutorials/lexsim/lexsim.html},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
  edition      = {2025.04.02}
  doi      = {}
}

Code

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] flextable_0.9.7   lubridate_1.9.4   forcats_1.0.0     stringr_1.5.1    
 [5] dplyr_1.1.4       purrr_1.0.4       readr_2.1.5       tidyr_1.3.1      
 [9] tibble_3.2.1      ggplot2_3.5.1     tidyverse_2.0.0   hashr_0.1.4      
[13] stringdist_0.9.15

loaded via a namespace (and not attached):
 [1] generics_0.1.3          fontLiberation_0.1.0    renv_1.1.1             
 [4] xml2_1.3.6              stringi_1.8.4           hms_1.1.3              
 [7] digest_0.6.37           magrittr_2.0.3          evaluate_1.0.3         
[10] grid_4.4.2              timechange_0.3.0        fastmap_1.2.0          
[13] jsonlite_1.9.0          zip_2.3.2               scales_1.3.0           
[16] fontBitstreamVera_0.1.1 codetools_0.2-20        klippy_0.0.0.9500      
[19] textshaping_1.0.0       cli_3.6.4               rlang_1.1.5            
[22] fontquiver_0.2.1        munsell_0.5.1           withr_3.0.2            
[25] yaml_2.3.10             gdtools_0.4.1           tools_4.4.2            
[28] officer_0.6.7           parallel_4.4.2          uuid_1.2-1             
[31] tzdb_0.4.0              colorspace_2.1-1        assertthat_0.2.1       
[34] vctrs_0.6.5             R6_2.6.1                lifecycle_1.0.4        
[37] htmlwidgets_1.6.4       ragg_1.3.3              pkgconfig_2.0.3        
[40] pillar_1.10.1           gtable_0.3.6            glue_1.8.0             
[43] data.table_1.17.0       Rcpp_1.0.14             systemfonts_1.2.1      
[46] xfun_0.51               tidyselect_1.2.1        rstudioapi_0.17.1      
[49] knitr_1.49              htmltools_0.5.8.1       rmarkdown_2.29         
[52] compiler_4.4.2          askpass_1.2.1           openssl_2.3.2

AI Transparency Statement

This tutorial was re-developed with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to help revise the tutorial text, structure the instructional content, generate the R code examples, and write the checkdown quiz questions and feedback strings. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.

Back to HOME

References

Li, Baoli, and Liping Han. 2013. “Distance Weighted Cosine Similarity Measure for Text Classification.” In International Conference on Intelligent Data Engineering and Automated Learning, 611–18. Springer. https://doi.org/https://doi.org/10.1007/978-3-642-41278-3_74.

Zahrotun, Lisna. 2016. “Comparison Jaccard Similarity, Cosine Similarity and Combined Both of the Data Clustering with Shared Nearest Neighbor Method.” Computer Engineering and Applications Journal 5 (1): 11–18. https://doi.org/https://doi.org/10.18495/comengapp.v5i1.160.